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Abstract —In this paper we tackle the prohlem of efficient 
video event detection. We argue that linear detection functions 
should he preferred in this regard due to their scalahility and ef¬ 
ficiency during estimation and evaluation. A popular approach in 
this regard is to represent a sequence using a hag of words (BOW) 
representation due to its: (i) fixed dimensionality Irrespective of 
the sequence length, and (il) its ahllity to compactly model the 
statistics in the sequence. A drawback to the BOW representation, 
however, is the Intrinsic destruction of the temporal ordering 
information. In this paper we propose a new representation that 
leverages the uncertainty in relative temporal alignments between 
pairs of sequences while not destroying temporal ordering. Our 
representation, like BOW, is of a fixed dimensionality making 
it easily integrated with a linear detection function. Extensive 
experiments on CK+, 6DMG, and UvA-NEMO databases show 
significant performance improvements across both isolated and 
continuous event detection tasks. 

I. Introduction 

A popular strategy for learning a discriminative event 
detection function, /(X;0) : ^ R^, is to employ a 

linear function, 

/(X;0) = 0{X}^0 (1) 

where </>{X} is a vectorized feature representation of the 
multi-dimensional event sequence X G R^^^; D is the 
dimensionality of the signal; and M is the number of frames. 
This is in contrast to canonical methods for temporal detection 
in vision such as hidden Markov models (HMMs) m, latent 
dynamic conditional random fields (LDCRFs) ||2l, time series 
kernels 111, ID and dynamic time-alignment kernels 15] which 
have non-linear interactions between the model parameters, 6, 
and the feature representation, (^{X}. 

There are two central advantages for maintaining a linear 
relationship between <^{X} and 9 in Equation [T] Firstly, the 
linear form allows one to employ canonical max-margin linear 
detectors such as linear support vector machines (SVM) @ 
or structural output SVMs (SO-SVM) fT] which generalize 
well to high-dimensional discriminative learning problems. 
Secondly, during detector evaluation one can take advantage of 
efficient search strategies afforded to linear detectors (i.e. linear 
convolution, summed area tables, etc.) making the application 
of such detectors highly efficient. 

Recently, m, 0 demonstrated that state-of-the-art perfor¬ 
mance in temporal event detection can be achieved using a bag 
of words (BOW) representation of the temporal signal in con¬ 
junction with a SVM-style detector. Specifically, the authors 


compared their approach to canonical hidden state probabilistic 
methods for event detection such as hidden Markov models 
(HMMs), and demonstrated their BOWh-SVM method achieves 
superior performance in terms of computation and accuracy 
by a considerable margin. A drawback, however, to the BOW 
representation lies in the destruction of the temporal dynamics 
in the raw signal, X. It is the preservation of this temporal 
ordering information that is at the heart of this paper. 

Contributions: We make the following contributions in this 
paper, 

• We propose a novel strategy for learning the relative 
alignment uncertainty between pairs of training se¬ 
quences using an adaptation of dynamic time warping 
(DTW). Using this model of uncertainty we then 
propose a new representation which is an efficient 
linear transform of the raw input sequence which: (i) 
preserves temporal ordering information while aver¬ 
aging over alignment uncertainty, and (ii) ensures the 
representation is of a fixed dimensionality so as to be 
applicable within a linear event detection function. 

• We demonstrate that our approach has comparable 
computational cost to current state-of-the-art BOW 
linear detectors, but with the advantage of obtain¬ 
ing significantly better detection performance across 
the CK-h, 6DMG, and UvA-NEMO event detection 
datasets. 

We evaluate the proposed approach on three datasets for 
both isolated and continuous event detection, and demon¬ 
strate improved performance while retaining computational 
efficiency. The remainder of this paper is structured as follows: 
Section |II] presents an overview of existing literature, in par¬ 
ticular the bag of words representation, dynamic time warping 
and time series kernels; Section m presents our proposed 
approach and in Section |IV] we outline the features that we 
use in the proposed method; Section |V] evaluates our proposed 
approach; and Section |VT] concludes the paper. 

II. Background 
A. Bag of Words Representation 

Bag of words (BOW) representations can be viewed as 
simply taking the mean over all frames of a non-linear rep¬ 
resentation 77{x„i}, where the m-th frame vector is X = 


[xi,..., xm], such that, 


m—l 

The non-linear function obtains a sparse encoding of the frame 
vector, X, using the codebook matrix D G , where K 

is the number of codebook entries. The codebook is typically 
learned through k-means clustering. We can define this non¬ 
linear function as, 

pjx} = argmin||x — Db II, (3) 

b 

s.t. b G B 

where B = {ek}^=i is the non-convex set of all K dimen¬ 
sional vectors, e^, containing all zeros except for one at the k- 
th entry. 

An initial question one may ask is why destroy the temporal 
ordering information in X? One obvious motivation stems 
from the realization that the vectorized dimensionality of X 
will vary as a function of M, whereas ^i>{X} is invariant 
to M. The fixed dimensionality of (^{X} allows for training 
with canonical linear geometric classifiers such as linear SVM 
and structural output SVM. The inevitable information loss 
stemming from the taking the multi-dimensional average over 
all frames is somewhat mitigated by the application of the non¬ 
linear mapping in Equation|3] Without the non-linear mapping, 
one would simply be learning a detector model 6 from the 
multi-dimensional mean of X across frames. By encoding X 
non-linearly the destruction of information is not quite as 
severe with higher-order statistical moments being preserved 
(i.e. ^i{X} can be interpreted as a multidimensional histogram 
feature). 

Cost of Search: Another advantage of the BOW representa¬ 
tion is that since temporal ordering information is destroyed 
in Equation |2] searching over variable size window widths 
becomes computationally efficient through the judicious use 
of a summed area table (commonly referred to as the integral 
image nni in computer vision). In this strategy once we have 
applied the non-linear transform in Equation |3] to all frames 
in a sequence, one can then obtain a cumulative sum of the 
sequence, at a cost of 0{MK), and then obtain the BOW 
representation for any sub-window at a cost of only 0{K) 
operations. The sum area table method can only be employed 
for sequence representations such as BOW where temporal or¬ 
dering is destroyed. The major computational drawback to the 
BOW representation is the cost of mapping from X —? 7 {X} 
is 0{MKD) using a naive codebook search. 

B. Dynamic Time Warping 

A number of works have been proposed in the literature for 
temporal alignment im, mi. In this work we use dynamic 
time warping (DTW) due to its established performance on 
temporal alignment tasks. 

Lets assume we have two multi-dimensional 
sequences, X = [xi,...,xm] and Y = [yi,..., yat], 
of equal dimensionality D (i.e. x G and y G R^) but 
differing frame lengths, M and N respectively. We would 
like to temporally align these two sequences based on some 


distance metric. Eor our purposes this will be the Euclidean 
distance. Dynamic time warping (DTW) can be applied to 
align the two signals, and this can be expressed as solving, 

T 

DTW(X,Y) = min V ||X[7r,(f)] - Y[7r,(f)]||2 (4) 

where tTx and iZy are integer index vectors with the constraints 
that 1 = 7r^(l) < ...,< Px{T - 1) < Px{T) = M 

and 1 = 7ry(l) < TTy{2), ...,< Py{T - 1) < PyiT) = N 
with unitary increments and no simultaneous repetitions. The 
length T of the index vectors and tt^ are bound by T < 
M + N — 1. Eor all elements of and we define the 
increment r such that 


TTx{p+ 1) 


T^xijp) 

7Ty{p+l) 


T^yip) 


is constrained to the a set of 3 causal moves —>■, t and 
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It is the constraint of the causal moves defined in Equation |6] 
that makes an efficient solution to the DTW objective in 
Equation Impossible. Specifically, the causal constraints imply 
a tree-structure which can be solved efficiently through belief 
propagation (i.e. Viterbi decoding) with a cost of 0{MND). 

DTW Warping Matrices: One can re-write the objective in 
Equation |4] as, 

DTW(X, Y) = min | |XP, - YP, 11| (7) 

X y 

where P^; and Py are the M x T and N x T warping 
matrices respectively stemming from the set P that enforce 
causal deformations in time. Although unconventional, the 
concept of expressing the warps stemming from DTW align¬ 
ment as deformation matrices is crucial later for our proposed 
approach. Eigure [T] shows four different examples with their 
corresponding alignment paths. 

C. Time Series Kernels 

Time series kernels have been gaining in popularity re¬ 
cently for temporal classification and event detection 0, 0. 
Recently, Lorincz et al. a proposed the idea of employing a 
kernel SVM based on a time series kernel for event detection. 
In this approach they proposed an event detection function as, 

L 

/(X;0)-^aifc(X,Xi) (8) 

1=1 

where 6 = are the kernel SVM’s model pa¬ 

rameters specifically the L support weights ai (which have 
the binary support labels subsumed within them) and support 
vectors X;. The alignment kernel is defined as, 

/c(X„X,) =exp{-f-Dfw(X„X,)} (9) 

where f is a constant. The measure DTW() in Equation |7] is 
not technically a distance (as it does not obey the triangle 
inequality) so the authors propose projecting the result into 
the closest symmetric positive semi-definite kernel DTW(). 
Lorincz et al. also proposed various extensions and variations 











to the DTW kernel, such as the Global Alignment (GA) kernel, 
the details of which are outside the scope and focus of this 
paper. 

A real strength of this method is that it elegantly embraces 
the idea that alignment is a relative notion. Instead of trying 
to align all sequences to a single temporal frame of reference, 
the approach instead employs the notion of relative alignment 
between pairs of training examples. The authors reported state 
of the art event detection performance across a number of 
event detection datasets, also validating the importance of 
preserving temporal ordering information in any representation 
one employs for event detection. 

Computational Cost: Although achieving impressive empir¬ 
ical performance, time series kernels cost 0{LMND) for 
every window searched in a sequence. L is the number of 
support vectors, M is the length of the input sequence, N 
is the average length of the support vector sequences and D 
is the dimensionality of the sequences. Some work 0 has 
explored strategies for making these methods more efficient 
such as the employment of constrained DTWs Ha, which 
consider a smaller set of possible causal alignments. Even with 
these speed ups, the cost of evaluation is dramatically larger 
than most other event detection methods in current literature 
such as efficient BOW methods. Their strength, however, lies 
in their good empirical performance and the theoretical insight 
that temporal ordering is of high importance in event detection, 
and relative DTW alignment may be of service in effectively 
taking advantage of this redundancy. 

III. Proposed Approach 

In this paper we propose the employment of the following 
linear representation, 

${X} = XP (10) 

where the matrix P is a M x T matrix that causally warps all 
events into a common reference frame of length T (irrespective 
of the raw length M of X). The choice of T is chosen to 
be larger than all training sequences. The central strength of 
this representation is that, depending on the nature of P, all 
temporal ordering information is preserved. Further, the rep¬ 
resentation is a linear transformation of the raw signal X cir¬ 
cumventing the sometimes costly non-linear mapping required 
in canonical BOW representations. An obvious drawback, 
however, to this approach is how to obtain the alignment 
matrix P? 

A. Choosing P 

An obvious choice for P is simply a interpolation matrix 
to transform any sequence X of varying frame length M into 
a fixed frame length sequence of T frames. This warping 
results in a homogeneous temporal stretching or squeezing. We 
shall herein refer to this interpolation matrix as P^xT’ which 
stretches a sequence of length M to length T. In all our work, 
we employ a linear interpolation although other interpolation 
strategies can be entertained. 

A drawback to this naive strategy, however, is that it is 
almost always sub-optimal if one entertains the DTW set P 
of all causal deformation matrices discussed in Section III-BI 


For example, one can nearly always find a superior alignment 
between two sequences X G and Y G in terms 

of their Frobenius norms, such that 

||X,P^^^-X,P;vxtIII >pmm^pl|XP.-YP,||^ (11) 

where P is the set of causal DTW matrices previously defined 
in Equation |7] An issue, however, is that the notion of 
alignment in Equation [TT] is relative to X and Y. It is difficult 
to ascertain what P^; or Py should be without knowing a priori 
what sequence or sequences you are aligning against. 

B. Learning P 

Inspired by the work of a, a we propose a variation 
upon our naive representation in Equation [TOl 

(12) 

I I PeG 
= XP^^j,P 

where G is a set of learned T x T temporal deformation 
matrices. Instead of estimating an absolute alignment for X, 
our representation instead takes the expectation of the uncer¬ 
tainty in absolute alignment encapsulated in the set G. For 
computational efficiency, the summation of X^pgG P = P 
can be pre-computed and P*mxt 1® *^1*® linear interpolation 
matrix to ensure the raw sequence X of length M is always 
of a fixed length T. We should note that we are not claiming P 
itself to be a warping matrix (since it is the average of a 
set of warping matrices which belong to a non-convex set). 
Instead, P should be just considered a pre-computation of the 
averaging procedure described in Equation [12] 

Learning G: We apply a simple but effective strategy for 
learning the set G where we estimate the deformation matrices 
through the DTW objective of Equation [7] for all pairs of posi¬ 
tive class sequence examples which we shall, for convenience, 
simply refer to as X G and Y G R^^^. Each pair 

of sequences shall produce the M xT and N xT alignment 
matrices Px and Py respectively (they are estimated in reverse 
pairing as well). All these alignment matrices are collated into 
the learned set 

— {^l^TlXTmax^^=l ( 13 ) 

where L is the total number of estimated deformation matri¬ 
ces P/ across all pairs of positive class sequences, P^, xTn,a 
the linear interpolation matrix to scale all deformation matrices 
to a common length where T^ax = inax{Ti}^ is chosen 
to ensure that no temporal detail is lost. Figure [Tf shows four 
aligned pairs of videos and their corresponding warping matrix 

P;. 

Computational Cost: Unlike the time series kernel method 
of HI (see Section III-CI ) our proposed approach is computa¬ 
tionally efficient. Although we cannot take advantage of the 
sum area table method of BOW representations, our linear 
approach does not require any non-linear mappings. Further, 
during evaluation one can actually pre-compute the application 
of the warping matrices, 

/(X;0) = vec{XP^^j,P}^0 (14) 

= vec{X}0M, 
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Fig. 1: Example of four aligned sequences from two different databases and their corresponding alignment paths. Figure (a)-top 
shows two aligned deliberate smile video frames and Figure(a)-bottom shows two aligned spontaneous smile video frames from 
UvA-NEMO database. Figure (b)-top shows two aligned AU-1 video frames and Figure (b)-bottom shows two aligned AU-12 
video frames from CK+ database. 


so that a number of 6m G '^dmxi jjjjgar models of varying 
window size, M, can be pre-computed from 6 so as to 
efficiently handle varying window sizes efficiently. The cost 
of evaluating a single window is then 0{MD) which is 
comparable to the cost of 0{KD) of applying the K entry 
codebook encoding to a new frame with a BOW representation. 
Also for faster detection we only select those values of M 
which are more likely to happen. Finally, for offline or buffered 
applications this approach can also utilize efficient FFT based 
convolutions in time to further decrease computational load. 

Continuous Event Detection: Algorithm 1 shows our pro¬ 
posed model for detecting events in continuous video, ie. 
detecting a particular event in an unknown sequence with 
unknown starting and ending locations. We learn our model 
for the continuous problem by using a structured output SVM 
(SO-SVM) as presented in ||7], because of its strengths in 
continuous domains. For our SO-SVM we use the same model 
as presented in lO, ||9] for the loss function and the training 
model. 

Non-Linear Extensions: It becomes obvious that one can 
apply similar a strategy for learning G to the non-linear 
representation, 77{X}, of the codebook encoding function 
described in Equation [3] The only additional computational 
cost in testing is the 0{KD) cost of applying the K entry 
codebook encoding to a new frame. 

IV. Feature extraction from video 
A. Feature Extraction 

There are two general approaches for video feature extrac¬ 
tion, shape-based m, ina and appearance-based na, ini 
methods. Common to all appearance-based methods, they have 
some limitations due to changes in camera view, illumination 


Algorithm 1: Our Approach (Continuous Event Detec¬ 
tion) 

Input ; Input examples X G Model 

parameter 6, Event size M. 

Output : Event Start, Event End 

Initialize; X, 9, M 

1 while j £ M do 

2 9j £- linearly interpolate {9} 

3 score £- com (X, 9j) 

4 end 

5 {start,end} ^ max (score) 


variations, and the speed of action. On the other hand, ge¬ 
ometric approaches follow the movement of some key parts 
or points (for instance on a body or face) and try to capture 
the temporal movement as a sequence of observations. In this 
paper, we use shape to represent each video frame vector. 
We use facial feature points and 6D comprehensive motion 
data, including position, orientation, acceleration and angular 
speed tracking for body gestures to build the observation data. 
The facial points are tracked using Constrained Local Models 
(CLM) US). After the facial components have been tracked, 
a similarity transformation is applied to facial features with 
respect to the normal facial shape to eliminate all variations 
including, scale, rotation and transition. Figure |2}b shows an 
example of facial landmark features in several frames of the 
UvA-NEMO 1191 video database. 

B. Feature Encoding 

Shape features, X, are extracted from each frame as 
described in Section IIV-AI and are encoded in one of three 
ways. 



















Linear: refers to the raw feature representation, i.e. X is used 
without any encoding. 

Delta: refers to using a differential signal such that feature 
becomes X(n) — X(n — 1). 

Non-Linear: refers to the raw representation being encoded 
using a codebook function, 77{X}. We can also encode the 
delta signal with the codebook function. 

V. Evaluation 

This section describes our experiments on three publicly 
available databases, CK+ EO] UvA-NEMO Ql] and 6D 
Motion Gesture Database ED. We evaluate our proposed ap¬ 
proach for the detection of both isolated and continuous events. 
An overview of the databases in presented in Section IV-Al 
Section IV-BI details the experimental settings used; Section 
IV-CI outlines the metrics we use to evaluate our approach; 
Section IV-DI presents our results for isolated and continuous 
event detection tasks; and Section FV-EI compares our proposed 
approach with other state of the art methods. 

A. Databases 

6D Motion Gesture Database: The 6DMG database contains 
comprehensive motion data, including the the 3D position, 
orientation, acceleration, and angular speed for sets of different 
motion gestures performed by different users. The database 
contains three subsets; motion gestures, air-handwriting and 
air-hngerwriting. In this work we used the air-handwriting set. 
The WorldsViz PPT-X4 optical tracking system was used to 
track infra-red dots that were mounted at the top of a Wiimote. 
Overall, the tracking device provided 6D spatio-temporal in¬ 
formation, including the position, orientation, acceleration and 
angular speed. They adjusted the scale of the 3D model to 
make the rendered motion as close to the real-world action as 
possible. This database contains 26 upper-case letters (A to Z) 
for motion characters. Each character is repeated 10 times for 
every subject. Sequences vary in duration between 27 and 412 
frames. To eliminate allographs or different stroke orders, the 
subjects were instructed to follow a certain “stroke order” for 
each character (as is shown Eigure |2l-a). 

UvA-NEMO Database: The UvA-NEMO database is col¬ 
lected to analyse smiles. This database is composed of video 
recorded with a Panasonic HDC-HS700 3MOS camcorder 
placed approximately 1.5 meters away from subjects. The 
database has 1240 smile videos in two classes, spontaneous 
and posed (597 spontaneous and 643 posed) from 400 subjects 
(185 female and 215 male). The age of subjects varies from 
8 to 76 years. Eor posed smiles, each subject was asked 
to pose a smile as realistically as possible. Eor spontaneous 
smiles a short funny video was shown to each person to elicit 
spontaneous smiles. Each sequence starts and ends in neutral or 
near neutral expressions. Sequences vary in duration between 
50 and 715 frames. To track the facial landmarks, we use the 
recently proposed CML method IfTSl to track 66 landmarks 
from each face. All tracked facial feature points are registered 
to a reference face by using a similarity transformation. Some 
examples from this database are shown in Eigure |2}b. 



(G 


Eig. 2: a) Example of ’’stroke order” for 6DMG database, b) 
Some examples for UvA-NEMO database, c) Some examples 
for CKh- databases. 


CK-h Database: The CKh- Database is a facial expression 
database. It contains 593 facial expression sequences from 123 
participants. Each sequence starts from a neutral face and ends 
at the peak frame. Sequences vary in duration between 4 and 
71 frames, and the location of 68 facial landmarks are provided 
along with database. Eacial poses are frontal with slight head 
motions. All the facial feature points are registered to a 
reference face by using a similarity transformation. Examples 
from this database are shown in Eigure |2}c. 

B. Experimental setup 

Training/Testing spUt: In our experiments, we use a 5-fold 
cross-validation to evaluate our approach. Approximately 80% 
of instances in each database are used for training and the 
remaining 20% are used for testing. 

Isolated event detection task: In this task we choose three 
common databases 6DMG, UvA-NEMO and CKh- as pre¬ 
sented in Subsection IV-Al Eor each sequence the start and 
end points of the event of interest are known a priori. Eor 
evaluation, we use a linear SVM and LIBSVM 1221 package. 
We perform a standard grid-search on cross-validation to tune 
parameters (including the C on the SVM). 

Continuous event detection task: To test our proposed 
method on the continuous problem we use the 6DMG database. 
We consider detecting “A” in a word which is preceded and 
followed by hve random letters, “B” to “Z”. The size of the 
sequences vary from 981 to 1482 frames. In this case the start 
and end points of the event of interest are unknown. We use 
the 6DMG database for the continuous event detection problem 
because it has longer videos compared to other databases. Eor 
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Fig. 3; Graphs comparing the accuracy and Fi-score on Non-Linear and Linear features with different P models. Delta corresponds 
to the differential signal (X(n) — X(n — 1)). 



Fig. 4: The mapping matrix P for seven different cases across the three databases (6DMG, CK-t and Uva-NEMO); JThe first 
four columns show the proposed approach with different feature encodings (see Section ITV-BI) : the fifth column shows P for the 
Histogram case; and the last column shows P for the Identity case. 


evaluation, we use SO-SVM (using the package Q). 

We perform a standard grid-search on the validation set to tune 
parameters (including parameter C in SO-SVM). 

Number of temporal codebooks: For building the codebooks, 
k-means clustering is used. In our experiments we perform 
cross-validation to tune the number of temporal codebooks. In 
this work we set 300 codebooks for 6DMG, 136 for CK-t and 
1500 for the UvA-NEMO database for the original signal, and 
in the case of delta signal we set these values to 100, 30, and 
500 respectively. 


* available at: http://www.cs.comell.edu/people/tj/svm_light/svm_struct.html 


C. Evaluation metrics 

To evaluate the performance, we report the area under ROC 
curve, and the maximum Fi-score. The Fi-score is defined 
as: Fi = ^Rmaif+Predsion ° ’ conveys the balance between 

the precision and recall. The Fi-score is a better performance 
measure than the area under ROC curve because the ROC 
curve is designed to measure the binary classification rather 
than detection and fails to reflect the effect of the proportion 
of the positive to negative samples. 

D. Results 

Eigure |3] compares the performance of different configura¬ 
tions of the proposed approach, reporting the average accuracy 
























































































Method 

Computational 
time (s) 

Area under 

ROC curve 

BOW -1- SO-SVM la, (9) 

135.8995 

56.27 

Our method -l- SO-SVM 

81.3082 

58.30 


TABLE I; Comparing our proposed approach (using a linear 
encoding of the original signal) with methods of |[8], |j9] and 
a on three databases. The table shows the area under ROC 
curve. 


Method 

Area under ROC curve 

CK-l- 

6DMG 

UvA-NEMO 

BOW -H SVM (8), 0 

71.83 

87.81 

63.21 

Lorincz et al. (4) 

89.13 

89.77 

75.25 

Our method -l- SVM 

90.86 

96.19 

81.87 


TABLE II; Comparing our proposed approach (using a linear 
encoding of the original signal) with methods of ||8], ||9] and 
a on three databases. The table shows the area under ROC 
curve. 


and Fi-score among all classes, and Eigure |4] shows the varia¬ 
tions of P used in Equation [12] learned using the proposed ap¬ 
proach in Section ITlI-BI We investigate the impact of different 
feature encodings (see Section HV-Bl i: “Linear” refers to using 
the raw representation X; “Delta” refers to using differential 
signal X(n) — X(n— 1); and “Non-Linear” refers to using the 
codebook encoding function 77{X}, which we also apply to the 
differential signal. Eigure |4| visualises the P matrices learned 
through the DTW procedure described in Section IIII-BI and 
we also compare to two other representations: “Hist”, where 
all elements of P are set to unity; and “Eye”, where P is 
simply an identity matrix. 

It is interesting to note that our representation, when 
employing HIST for P in conjunction with a Non-Linear 
representation ? 7 {X}, is equivalent to the BOW representation 
described in Equation |2| We can see that using the non-linear 
representation with a histogram for P (i.e. BOW), performs 
poorly. This is to be expected as the BOW representation 
throws away all temporal information. On the other hand, 
stretching the observations to a standard length (linear in¬ 
terpolation, “Eye”) shows better performance than using a 
histogram, as this preserves some temporal ordering. As can 
be seen from the graphs, our proposed model outperforms both 
the BOW and naive interpolation methods. In this case learning 
P from DTW alignment helps the model to preserve the 
temporal ordering information. The results also show that using 
the non-linear representation degrades performance across all 
datasets and types of P matrix. 

Table ?? shows performance for the continuous event 
detection problem, and compares the run times and area under 
ROC curve of our proposed method (using a linear encoding 
of the original signal) from Section IIII-BI with that of BOW 
method. The cost of search in our proposed model is much 
less than using BOW, while also achieving better performance. 
The run times shown in Table ?? are achieved using Matlab 
implementations on a Intel i7 2.1 GHZ dual core CPU with 
16GB RAM. 


Method 

El-score 

CK-l- 

6DMG 

UvA-NEMO 

BOW -1- SVM (D, (9) 

48.70 

39.64 

59.84 

Lorincz et al. (4) 

71.33 

53.84 

78.50 

Our method -l- SVM 

70.79 

58.33 

79.56 


TABLE III; Comparing our proposed approach (using a linear 
encoding of the original signal) with methods of m, M and 
SI on three databases. The table shows the Fi-score. 


E. Comparing with other methods 

In this subsection we compare our method (using a linear 
encoding of the original signal) with the state-of-the-art BOW 
method a, 0 and the time series kernel method of Lorincz 
et al. SI- 

The BOW method was proposed to tackle the problem 
of action unit detection. a, 0 compared their method with 
a frame-based SVM approach and a dynamic method using 
HMMs. They showed a segment-based SVM classifier using 
BOW feature vectors outperforms both a frame-based SVM 
and a HMM with two or four states. The major difference be¬ 
tween frame-based SVM and segment-based one is the former 
classifies each frame independently while the latter considers 
collection of frames for prediction. We implement segment- 
based SVM using BOW proposed by 0, 0 and compare 
it against our proposed approach introduced in Section IIII-BI 
The area under ROC curve and Fi-score for this comparison 
are reported in Table |II| and Table [Bll on above-mentioned 
databases. As shown, our approach significantly outperforms 
segment-based SVM. 

We also compare our method against Lorincz et al. 0. 
They proposed to use a time series kernel for event detection 
and obtained state-of-the-art performance for expression classi¬ 
fication. As can be seen, our method outperforms 0. We also 
note that the computational cost for our proposed method is 
0{MD), however the computational complexity of using time 
series kernel in 0 is 0{LMND) where L is the number of 
support vectors, M is the length of the input sequence, N is 
the average length of the support vector sequences and D is 
the dimensionality of the sequences. 

VI. Conclusion 

In this paper we addressed the problem of event detection 
and presented a simple, yet efficient, approach. Our proposed 
algorithm preserves temporal ordering that is essential for the 
analysis of problems with a dynamic nature. In this approach, 
instead of aligning all sequences to a single temporal refer¬ 
ence, we employed the notion of relative alignment between 
pairs of training examples. This approach proved effective 
in our empirical evaluations and maintained ordering whilst 
preserving the discriminative characteristics of the problem. 
We also demonstrated how the proposed approach could be 
extended to tackle the problem of continuous event detection, 
and demonstrated efficient and accurate performance. 
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